47 research outputs found

    A text-mining system for extracting metabolic reactions from full-text articles

    Get PDF
    Background: Increasingly biological text mining research is focusing on the extraction of complex relationships relevant to the construction and curation of biological networks and pathways. However, one important category of pathway—metabolic pathways—has been largely neglected. Here we present a relatively simple method for extracting metabolic reaction information from free text that scores different permutations of assigned entities (enzymes and metabolites) within a given sentence based on the presence and location of stemmed keywords. This method extends an approach that has proved effective in the context of the extraction of protein–protein interactions. Results: When evaluated on a set of manually-curated metabolic pathways using standard performance criteria, our method performs surprisingly well. Precision and recall rates are comparable to those previously achieved for the well-known protein-protein interaction extraction task. Conclusions: We conclude that automated metabolic pathway construction is more tractable than has often been assumed, and that (as in the case of protein–protein interaction extraction) relatively simple text-mining approaches can prove surprisingly effective. It is hoped that these results will provide an impetus to further research and act as a useful benchmark for judging the performance of more sophisticated methods that are yet to be developed

    Using structural motifs to identify proteins with DNA binding function

    Get PDF
    This work describes a method for predicting DNA binding function from structure using 3-dimensional templates. Proteins that bind DNA using small contiguous helix¿turn¿helix (HTH) motifs comprise a significant number of all DNA-binding proteins. A structural template library of seven HTH motifs has been created from non-homologous DNA-binding proteins in the Protein Data Bank. The templates were used to scan complete protein structures using an algorithm that calculated the root mean squared deviation (rmsd) for the optimal superposition of each template on each structure, based on Ca backbone coordinates. Distributions of rmsd values for known HTH-containing proteins (true hits) and non-HTH proteins (false hits) were calculated. A threshold value of 1.6 Å rmsd was selected that gave a true hit rate of 88.4% and a false positive rate of 0.7%. The false positive rate was further reduced to 0.5% by introducing an accessible surface area threshold value of 990 Å2 per HTH motif. The template library and the validated thresholds were used to make predictions for target proteins from a structural genomics project

    Computational modelling of the binding of arachidonic acid to the human monooxygenase CYP2J2

    Get PDF
    An experimentally determined structure for human CYP2J2—a member of the cytochrome P450 family with significant and diverse roles across a number of tissues—does not yet exist. Our understanding of how CYP2J2 accommodates its cognate substrates and how it might be inhibited by other ligands thus relies on our ability to computationally predict such interactions using modelling techniques. In this study we present a computational investigation of the binding of arachidonic acid (AA) to CYP2J2 using homology modelling, induced fit docking (IFD) and molecular dynamics (MD) simulations. Our study reveals a catalytically competent binding mode for AA that is distinct from a recently published study that followed a different computational pipeline. Our proposed binding mode for AA is supported by crystal structures of complexes of related enzymes to inhibitors, and evolutionary conservation of a residue whose role appears essential for placing AA in the right site for catalysis

    Molecular dynamics simulations of the interaction of wild type and mutant human CYP2J2 with polyunsaturated fatty acids

    Get PDF
    Objectives: The data presented here is part of a study that was aimed at characterizing the molecular mechanisms of polyunsaturated fatty acid metabolism by CYP2J2, the main cytochrome P450 enzyme active in the human cardiovasculature. This part comprises the molecular dynamics simulations of the binding of three eicosanoid substrates to wild type and mutant forms of the enzyme. These simulations were carried out with the aim of dissecting the importance of individual residues in the active site and the roles they might play in dictating the binding and catalytic specificity exhibited by CYP2J2. Data description: The data comprise: a) a new homology model of CYP2J2, b) a number of predicted low-energy complexes of CYP2J2 with arachidonic acid, docosahexaenoic acid and eicosapentaenoic acid, produced with molecular docking and c) a series of molecular dynamics simulations of the wild type and four mutants interacting with arachidonic acid as well as simulations of the wild type interacting with the two other eicosanoid ligands. The simulations may be helpful in identifying the determinants of substrate specificity of this enzyme and in unraveling the role of individual mutations on its function. They may also help guide the generation of mutants with altered substrate preferences

    Molecular docking for substrate identification: the short-chain dehydrogenases/reductases

    Get PDF
    Protein ligand docking has recently been investigated as a tool for protein function identification, with some success in identifying both known and unknown substrates of proteins. However, identifying a protein's substrate when cross-docking a large number of enzymes and their cognate ligands remains a challenge. To explore a more limited yet practically important and timely problem in more detail, we have used docking for identifying the substrates of a single protein family with remarkable substrate diversity, the short-chain dehydrogenases/reductases. We examine different protocols for identifying candidate substrates for 27 short-chain dehydrogenase/reductase proteins of known catalytic function. We present the results of docking > 900 metabolites from the human metabolome to each of these proteins together with their known cognate substrates and products, and we investigate the ability of docking to (a) reproduce a viable binding mode for the substrate and (b) to rank the substrate highly amongst the dataset of other metabolites. In addition, we examine whether our docking results provide information about the nature of the substrate, based on the best-scoring metabolites in the dataset. We compare two different docking methods and two alternative scoring functions for one of the docking methods, and we attempt to rationalise both successes and failures. Finally, we introduce a new protocol, whereby we dock only a set of representative structures (medoids) to each of the proteins, in the hope of characterising each binding site in terms of its ligand preferences, with a reduced computational cost. We compare the results from this protocol with our original docking experiments, and we find that although the rank of the representatives correlates well with the mean rank of the clusters to which they belong, a simple structure-based clustering is too naïve for the purpose of substrate identification. Many clusters comprise ligands with widely varying affinities for the same protein; hence important candidates can be missed if a single representative is used

    baerhunter: an R package for the discovery and anal-ysis of expressed non-coding regions in bacterial RNA-seq data

    Get PDF
    Summary: Standard bioinformatics pipelines for the analysis of bacterial transcriptomic data com-monly ignore non-coding but functional elements e.g. small RNAs, long antisense RNAs or untrans-lated regions (UTRs) of mRNA transcripts. The root of this problem is the use of incomplete genome annotation files. Here, we present baerhunter, a coverage-based method implemented in R, that au-tomates the discovery of expressed non-coding RNAs and UTRs from RNA-seq reads mapped to a reference genome. The core algorithm is part of a pipeline that facilitates downstream analysis of both coding and non-coding features. The method is simple, easy to extend and customize and, in limited tests with simulated and real data, compares favourably against the currently most popular alternative. Availability: The baerhunter R package is available from: https://github.com/irilenia/baerhunter Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online

    Dysregulation of alternative poly-adenylation as a potential player in Autism Spectrum Disorder

    Get PDF
    We present here the hypothesis that alternative poly-adenylation (APA) is dysregulated in the brains of individuals affected by Autism Spectrum Disorder (ASD), due to disruptions in the calcium signaling networks. APA, the process of selecting different poly-adenylation sites on the same gene, yielding transcripts with different-length 3′ untranslated regions (UTRs), has been documented in different tissues, stages of development and pathologic conditions. Differential use of poly-adenylation sites has been shown to regulate the function, stability, localization and translation efficiency of target RNAs. However, the role of APA remains rather unexplored in neurodevelopmental conditions. In the human brain, where transcripts have the longest 3′ UTRs and are thus likely to be under more complex post-transcriptional regulation, erratic APA could be particularly detrimental. In the context of ASD, a condition that affects individuals in markedly different ways and whose symptoms exhibit a spectrum of severity, APA dysregulation could be amplified or dampened depending on the individual and the extent of the effect on specific genes would likely vary with genetic and environmental factors. If this hypothesis is correct, dysregulated APA events might be responsible for certain aspects of the phenotypes associated with ASD. Evidence supporting our hypothesis is derived from standard RNA-seq transcriptomic data but we suggest that future experiments should focus on techniques that probe the actual poly-adenylation site (3′ sequencing). To address issues arising from the use of post-mortem tissue and low numbers of heterogeneous samples affected by confounding factors (such as the age, gender and health of the individuals), carefully controlled in vitro systems will be required to model the effect of calcium signaling dysregulation in the ASD brain

    Visualisation of variable binding pockets on protein surfaces by probabilistic analysis of related structure sets

    Get PDF
    Background: Protein structures provide a valuable resource for rational drug design. For a protein with no known ligand, computational tools can predict surface pockets that are of suitable size and shape to accommodate a complementary small-molecule drug. However, pocket prediction against single static structures may miss features of pockets that arise from proteins’ dynamic behaviour. In particular, ligand-binding conformations can be observed as transiently populated states of the apo protein, so it is possible to gain insight into ligand-bound forms by considering conformational variation in apo proteins. This variation can be explored by considering sets of related structures: computationally generated conformers, solution NMR ensembles, multiple crystal structures, homologues or homology models. It is non-trivial to compare pockets, either from different programs or across sets of structures. For a single structure, difficulties arise in defining particular pocket’s boundaries. For a set of conformationally distinct structures the challenge is how to make reasonable comparisons between them given that a perfect structural alignment is not possible. Results: We have developed a computational method, Provar, that provides a consistent representation of predicted binding pockets across sets of related protein structures. The outputs are probabilities that each atom or residue of the protein borders a predicted pocket. These probabilities can be readily visualised on a protein using existing molecular graphics software. We show how Provar simplifies comparison of the outputs of different pocket prediction algorithms, of pockets across multiple simulated conformations and between homologous structures. We demonstrate the benefits of use of multiple structures for protein-ligand and protein-protein interface analysis on a set of complexes and consider three case studies in detail: i) analysis of a kinase superfamily highlights the conserved occurrence of surface pockets at the active and regulatory sites ii) a simulated ensemble of unliganded Bcl2 structures reveals extensions of a known ligand-binding pocket not apparent in the apo crystal structure; iii) visualisations of interleukin-2 and its homologues highlight conserved pockets at the known receptor interfaces and regions whose conformation is known to change on inhibitor binding. Conclusions: Through post-processing of the output of a variety of pocket prediction software, Provar provides a flexible approach to the analysis and visualization of the persistence or variability of pockets in sets of related protein structures

    KSHV SOX mediated host shutoff: the molecular mechanism underlying mRNA transcript processing.

    Get PDF
    Onset of the lytic phase in the KSHV life cycle is accompanied by the rapid, global degradation of host (and viral) mRNA transcripts in a process termed host shutoff. Key to this destruction is the virally encoded alkaline exonuclease SOX. While SOX has been shown to possess an intrinsic RNase activity and a potential consensus sequence for endonucleolytic cleavage identified, the structures of the RNA substrates targeted remained unclear. Based on an analysis of three reported target transcripts, we were able to identify common structures and confirm that these are indeed degraded by SOX in vitro as well as predict the presence of such elements in the KSHV pre-microRNA transcript K12-2. From these studies, we were able to determine the crystal structure of SOX productively bound to a 31 nucleotide K12-2 fragment. This complex not only reveals the structural determinants required for RNA recognition and degradation but, together with biochemical and biophysical studies, reveals distinct roles for residues implicated in host shutoff. Our results further confirm that SOX and the host exoribonuclease Xrn1 act in concert to elicit the rapid degradation of mRNA substrates observed in vivo, and that the activities of the two ribonucleases are co-ordinated

    Using a whole genome co-expression network to inform the functional characterisation of predicted genomic elements from Mycobacterium tuberculosis transcriptomic data

    Get PDF
    A whole genome co-expression network was created using Mycobacterium tuberculosis transcriptomic data from publicly available RNA-sequencing experiments covering a wide variety of experimental conditions. The network includes expressed regions with no formal annotation, including putative short RNAs and untranslated regions of expressed transcripts, along with the protein-coding genes. These unannotated expressed transcripts were among the best-connected members of the module sub-networks, making up more than half of the ‘hub’ elements in modules that include protein-coding genes known to be part of regulatory systems involved in stress response and host adaptation. This dataset provides a valuable resource for investigating the role of non-coding RNA, and conserved hypothetical proteins, in transcriptomic remodelling. Based on their connections to genes with known functional groupings and correlations with replicated host conditions, predicted expressed transcripts can be screened as suitable candidates for further experimental validation
    corecore